skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Savelka, Jaromir"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Code completion problems are an effective type of formative assessment; especially, when used to practice newly learned concepts or topics. While there is a growing body of research in computing education on the use of large language models (LLMs) to support learning content development, the use of LLMs for producing high-quality code completion problems has not yet been explored. In this paper, we analyze the capability of LLMs to generate effective distractors (i.e., plausible but incorrect options) and explanations for completion problems. We utilize common student misconceptions to improve the quality of the generated distractors. Our study suggests that LLMs are capable of generating reasonable distractors and explanations. At the same time, we identify a lack of a sufficiently granular taxonomy of common student misconceptions that would be needed for aligning the generated distractors with the common misconceptions and errors -- a gap that should be addressed in future work. 
    more » « less
    Free, publicly-accessible full text available May 14, 2026
  2. As large language models (LLMs) show great promise in generating a wide spectrum of educational materials, robust yet cost-effective assessment of the quality and effectiveness of such materials becomes an important challenge. Traditional approaches, including expert-based quality assessment and student-centered evaluation, are resource-consuming, and do not scale efficiently. In this work, we explored the use of pre-existing student learning data as a promising approach to evaluate LLM-generated learning materials. Specifically, we used a dataset where students were completing the program construction challenges by picking the correct answers among human-authored distractors to evaluate the quality of LLM-generated distractors for the same challenges. The dataset included responses from 1,071 students across 22 classes taught from Fall 2017 to Spring 2023. We evaluated five prominent LLMs (OpenAI-o1, GPT-4, GPT-4o, GPT-4o-mini, and Llama-3.1-8b) across three different prompts to see which combinations result in more effective distractors, i.e., those that are plausible (often picked by students), and potentially based on common misconceptions. Our results suggest that GPT-4o was the most effective model, matching close to 50% of the functional distractors originally authored by humans. At the same time, all of the evaluated LLMs generated many novel distractors, i.e., those that did not match the pre-existing human-authored ones. Our preliminary analysis shows that those appear to be promising. Establishing their effectiveness in real-world classroom settings is left for future work. 
    more » « less
    Free, publicly-accessible full text available March 3, 2026
  3. Automated grading systems, or auto-graders, have become ubiquitous in programming education, and the way they generate feedback has become increasingly automated as well. However, there is insufficient evidence regarding auto-grader feedback’s effectiveness in improving student learning outcomes, in a way that differentiates students who utilized the feedback and students who did not. In this study, we fill this critical gap. Specifically, we analyze students’ interactions with auto-graders in an introductory Python programming course, offered at five community colleges in the United States. Our results show that students checking the feedback more frequently tend to get higher scores from their programming assignments overall. Our results also show that a submission that follows a student checking the feedback tends to receive a higher score than a submission that follows a student ignoring the feedback. Our results provide evidence on auto-grader feedback’s effectiveness, encourage their increased utilization, and call for future work to continue their evaluation in this age of automation. 
    more » « less
    Free, publicly-accessible full text available January 1, 2026
  4. Generative AI (GenAI) is advancing rapidly, and the literature in computing education is expanding almost as quickly. Initial responses to GenAI tools were mixed between panic and utopian optimism. Many were fast to point out the opportunities and challenges of GenAI. Researchers reported that these new tools are capable of solving most introductory programming tasks and are causing disruptions throughout the curriculum. These tools can write and explain code, enhance error messages, create resources for instructors, and even provide feedback and help for students like a traditional teaching assistant. In 2024, new research started to emerge on the effects of GenAI usage in the computing classroom. These new data involve the use of GenAI to support classroom instruction at scale and to teach students how to code with GenAI. In support of the former, a new class of tools is emerging that can provide personalized feedback to students on their programming assignments or teach both programming and prompting skills at the same time. With the literature expanding so rapidly, this report aims to summarize and explain what is happening on the ground in computing classrooms. We provide a systematic literature review; a survey of educators and industry professionals; and interviews with educators using GenAI in their courses, educators studying GenAI, and researchers who create GenAI tools to support computing education. The triangulation of these methods and data sources expands the understanding of GenAI usage and perceptions at this critical moment for our community. 
    more » « less
    Free, publicly-accessible full text available January 22, 2026
  5. Legal texts routinely use concepts that are difficult to understand. Lawyers elaborate on the meaning of such concepts by, among other things, carefully investigating how they have been used in the past. Finding text snippets that mention a particular concept in a useful way is tedious, time-consuming, and hence expensive. We assembled a data set of 26,959 sentences, coming from legal case decisions, and labeled them in terms of their usefulness for explaining selected legal concepts. Using the dataset we study the effectiveness of transformer models pre-trained on large language corpora to detect which of the sentences are useful. In light of models{'} predictions, we analyze various linguistic properties of the explanatory sentences as well as their relationship to the legal concept that needs to be explained. We show that the transformer-based models are capable of learning surprisingly sophisticated features and outperform the prior approaches to the task. 
    more » « less
  6. Maranhao, Juliano; Wyner, Adam (Ed.)
    In this paper, we assess the use of several deep learning classification algorithms as a step toward automatically preparing succinct summaries of legal decisions. Short case summaries that tease out the decision’s argument structure by making explicit its issues, conclusions, and reasons (i.e., argument triples) could make it easier for the lay public and legal professionals to gain an insight into what the case is about. We have obtained a sizeable dataset of expert-crafted case summaries paired with full texts of the decisions issued by various Canadian courts. As the manual annotation of the full texts is prohibitively expensive, we explore various ways of leveraging the existing longer summaries which are much less time-consuming to annotate. We compare the performance of the systems trained on the annotations that are manually ported to the full texts from the summaries to the performance of the same systems trained on annotations that are projected from the summaries automatically. The results show the possibility of pursuing the automatic annotation in the future. 
    more » « less